Multi-page document recognition in document capture

ABSTRACT

Techniques to capture document data are disclosed. It is determined that a sequence of pages in a stream of document page images comprise a single multi-page document. Data is extracted from two or more different pages included in the sequence. The data extracted from two or more different pages included in the sequence of pages is used to populate a data entry form associated with the multi-page document.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/290,453, filed Mar. 1, 2019, entitled “MULTI-PAGE DOCUMENTRECOGNITION IN DOCUMENT CAPTURE”, which is a continuation of U.S. patentapplication Ser. No. 15/221,433, filed Jul. 27, 2016, entitled“MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE”, issued as U.S.Pat. No. 10,248,858, which is a continuation of U.S. patent applicationSer. No. 13/720,671, filed Dec. 19, 2012, entitled “MULTI-PAGE DOCUMENTRECOGNITION IN DOCUMENT CAPTURE” issued as U.S. Pat. No. 9,430,453,which are incorporated herein by reference for all purposes.

BACKGROUND OF THE INVENTION

In document capture, typically pages are recognized and validated one ata time, in sequence. In the typical approach, each page is processedindependently with its own data entry form, value extraction, andvalidation. In the case of a multi-page document, typically the metadatadocument generated through document capture for each page has to bemapped to a multi-page structure and data values reconciled across pagesthrough additional processing.

In practice, during data validation human operators typically mustnavigate through multiple pages and associated data entry forms, forexample to compare and reconcile values that occur in different pages,etc. This approach depends on the knowledge of human operators of thelocation of data in different pages of a multiple page document, and inthe worst case may require an operator to hunt through multipleindependent pages and/or associated page-specific data entry forms tocross-validate data, for example. In addition, treating each page as aseparate document results in suboptimal processing of structures such astables, which may span multiple pages.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a flow chart illustrating an embodiment of a process tocapture data.

FIG. 2 is a block diagram illustrating an embodiment of a documentcapture system and environment.

FIG. 3 is a block diagram illustrating an embodiment of a documentcapture system.

FIG. 4 is a block diagram illustrating an embodiment of a datavalidation user interface.

FIG. 5 is a screen shot illustrating an embodiment of a technique tominimize eye strain and/or fatigue in manual indexing.

FIG. 6 is a flow chart illustrating an embodiment of a process tofacilitate manual indexing.

FIG. 7 is a block diagram illustrating an embodiment of a documentcapture system and process.

FIG. 8 is a block diagram illustrating an embodiment of an interface tovalidate a multi-page document.

FIG. 9A is a flow chart illustrating an embodiment of a process tocapture document data.

FIG. 9B is a flow chart illustrating an embodiment of a process tocapture document data.

FIG. 10 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.

FIG. 11 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.

FIG. 12 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

Processing a multi-page document as a single entity, with a singlecorresponding data entry form, in an automated document capture contextis disclosed. In various embodiments, the pages comprising a multi-pagedocument are identified and associated with a multi-page document type.A corresponding data entry form is used to provide a structuredrepresentation of data extracted from the pages of the multi-pagedocument. Structures that may span multiple pages, such as a table orlist of values, are associated together in a single array or otherstructure of the data entry form. Validation of extracted values basedon dependency fields that may occur on different pages is facilitated,both in automated processing and in human validation.

FIG. 1 is a flow chart illustrating an embodiment of a process tocapture data. In the example shown, document content is captured into adigital format (102), e.g., by scanning the physical sheet(s) to createa scanned image. The document is classified (104). In some embodiments,classification includes detecting a document type corresponding to anassociated data entry form. Data is extracted from the digital content(106), for example through optical character recognition (OCR) and/oroptical mark recognition (OMR) techniques. Extracted data is validated(108). In various embodiments, validation may be performed at least inpart by an automated process, for example by comparing multipleoccurrences of the same value, by performing computations or othermanipulations based on extracted data, etc. In various embodiments, allor a subset of extracted values, e.g., those for which less than arequired degree of confidence is achieved through automated extractionand/or validation, may be validated manually, by a human indexer orother operator. Once all data has been validated, output is delivered(110), e.g., by storing the document image and associated data in anenterprise content management system or other repository.

FIG. 2 is a block diagram illustrating an embodiment of a documentcapture system and environment. In the example shown, a client system212 is attached to a scanner 204. Documents are scanned by scanner 204and the resulting document image is sent by the client system 212 todocument capture system 202 for processing, e.g., using all or part ofthe process of FIG. 1. In the example shown, document capture system 202uses a library of data entry forms 206 to create a structuredrepresentation of data extracted from a scanned document. For example,as in FIG. 1 steps 104 and 106, in some embodiments a document isclassified by type and an instance of a corresponding data entry form iscreated and populated with data values extracted from the documentimage. In some embodiments, data validation may be performed, at leastin part, by document capture system 202 by accessing external data 208via a network 210. For example, an external third party database thatassociates street addresses with correct postal zip codes may be used tovalidate a zip code value extracted from a document. In the exampleshown, validation may be performed at least in part by a plurality ofmanual indexers each using an associated client system 212 tocommunicate via network 210 with document capture system 202. Forexample, document capture system 202 may be configured to queue humanvalidation tasks and to serve tasks out to indexers using clients 212.Each client system 212 may use a browser based and/or installed clientsoftware provided functionality to validate data as described herein. Insome embodiments, once validation has been completed the resulting rawdocument image and/or form data are delivered as output, for example bystoring the document image and associated form data in a repository 214,such as an enterprise content management (ECM) or other repository.

FIG. 3 is a block diagram illustrating an embodiment of a documentcapture system. In the example shown, the document capture system 202 ofFIG. 2 is shown to receive document image data, e.g., via network 204from a scanning client system 212. Document image data is received insome embodiments in batches and is stored in an image store 308.Document image data is provided to a data extraction module 310 whichuses a data entry forms library 312 to classify each document by typeand create an instance of a type-specific data entry form. Dataextraction module 310 uses OCR, OMR, and/or other techniques to extractdata values from the document image and uses the extracted values topopulate the corresponding data entry form instance. In someembodiments, data extraction module 310 may provide a score or otherindication of a degree of confidence with which an extracted value hasbeen determined based on a corresponding portion of the document image.In some embodiments, for each data entry form field a correspondinglocation within the document image from which the data value entered bythe extraction module in that form field was extracted, for example theportion that shows the text to which OCR or other techniques wereapplied to determine the text present in the image, is recorded. In theexample shown, the data extraction module 310 provides the populatedform to a validation module 314 configured to perform validation(automated and/or human as configured and/or required). In someembodiments, the validation module 314 applies one or more validationrules to identify fields that may require a human operator to validate.In the example shown, the validation module 314 may communicate via acommunications interface 316, for example a network interface card orother communications interface, to obtain external data to be used invalidation and/or to generate and provide to human indexers viaassociated client systems, such as one or more of clients 212 of FIG. 2,tasks to perform human/manual validation of all or a subset of formfields. The validated data is provided to a delivery/output module 318configured to provide output via communication interface 316, forexample by storing the document image and/or extracted data (structureddata as capture using the corresponding data entry form) in anenterprise content management system or other repository.

FIG. 4 is a block diagram illustrating an embodiment of a datavalidation user interface. In the example shown, validation interface400 includes a document image display area 402, a data entry forminterface 404, and a navigation frame 406. A document image 408 isdisplayed in document image display area 402. In the example shown,portions of document image 408 that correspond to data entry form fieldsin the form shown in data entry form interface 404 are highlighted, asindicated in FIG. 4 by the cross-hatched rectangles in document image408 as shown. In this example, thumbnails are shown in navigation pane406, each corresponding for example to an associated document and/orpage from which data has been captured. In this example, the topmostthumbnail image as shown in navigation frame 406 of FIG. 4 ishighlighted (thicker outer outline), indicating that document image 408as displayed in document image display area 402 corresponds to thetopmost thumbnail. In some embodiments, controls are provided (e.g., onscreen controls, key stokes or combinations, etc.) to enable theoperator to pan, scroll, and/or zoom in/out with respect to the documentimage 408, for example to focus and zoom in on (magnify) a particularportion of the document image 408. In some embodiments, as the operatorvalidates each field a cursor advances to the next field and acorresponding portion of the document image 408 is highlighted.

FIG. 5 is a screen shot illustrating an embodiment of a technique tominimize eye strain and/or fatigue in manual indexing. In the exampleshown, partial screen shot 500 includes a portion of a manual datavalidation user interface that includes a data entry form field 502, inthis example with a current value of “888-555-1348” displayed, andnearby to the form field, as displayed in the data entry form portion ofthe data validation interface, a snippet 504 taken from a correspondingdocument image, which shows just the portion of the document image thatcontains the image of the text (in this case numerical values) extractedfrom the document to populate the form field 502. In this example, aconfirmation or other informational and/or error message 506 similarlyis displayed near the form field 502. As a result, the form field 502,corresponding snippet 504, and confirmation message 506 are all in theline of sight, or nearly so, at the same time, enabling all informationrequired to validate the value entered in the form field 502, includingentering any correction that may be required, to be viewed at the sametime and/or with minimal eye or head movement and without requiring theoperator to scan back and forth between the document image frame and thedata entry form, and/or to scroll, pan, or zoom in/out in the documentimage as viewed to locate and scale to a readable size the text to bevalidated. In some embodiments, the snippet 504 is scaled to ensurereadability, for example by including in the snippet only (or mostly)the text to be validated and scaling the image to a readable size, forexample until the image is of at least a prescribed minimum size and/orthe displayed characters are of a prescribed minimum “point” or othersize.

In some embodiments, as an operator finishes validation of a field,indicated for example by pressing the “enter” key or selecting anotherkey or on screen control, the system automatically pans to the next dataentry form field, retrieves and displays near the form field acorresponding document image snippet. In this way, the operator cannavigate through the form and corresponding portions of the documentimage without retargeting, i.e., without having to redirect their eyesto a different point or points on the screen.

FIG. 6 is a flow chart illustrating an embodiment of a process tofacilitate manual indexing. In various embodiments, the process of FIG.6 is used to provide an interface such as the one shown and describedabove in connection with FIG. 5. In the example shown in FIG. 6, asnippet containing the text or other document image portioncorresponding to a data entry form field to be validated is obtained,and an association between the snippet and/or the associated location inthe document image, on the one hand, and the corresponding form field,on the other hand, is stored (602). The snippet is scaled as/if need forreadability (604). The scaled (if applicable) snippet is displayedadjacent or otherwise near to the form field where correspondingextracted data to be validated is displayed and/or entered (606).

Typically, as noted above pages comprising a multiple page document havebeen processed separately, each page having its own correspondingelectronic data entry form associated with it. The per-page formapproach has a number of shortcomings. For example, a value (e.g., anaccount number on the footer of each page in an invoice) may occur inseveral pages. An error on a single page results in work for theoperator, because typically there is no framework to reconcile dataacross pages and to auto-correct data. In addition, in production, theoperator will only become aware of the problem when he navigates to thepage. If there are large discrepancies between values of many pages,then the operator must manually look at each page and that takes time.

In semi-structured and unstructured documents, there can be any numberof variations of pages. If the data-entry form is page-based, and aunique form per page is used, this results in an unmanageable number offorms. If a generic form that contains a union of possible fields isused, this results in forms with unused fields. This requires extra workto handle. Furthermore, if a value is copied from another page, itssource value and location typically is not shown because only thecurrent page, and not the page from which the copied value wasextracted, is shown. If the page is changed, it would result in thedata-entry form being changed. Changing the data-entry form thendisrupts the sequence of work, resulting in lower operator efficiency.

Under the form-per-page approach, when a table spans multiple pages, thetechnique of copying data between pages results in a large set ofduplicate values. Extra effort is then needed to synchronize if the usermakes any changes on any page. The navigation problem described above iscompounded. For example, suppose the sub-totals on a multi-page invoiceline items table do not add up. It is more cumbersome for the operatorto go through each page and then each table, and to work with duplicaterow values.

In content management systems, the metadata object is usually notdefined per page. To export per-page forms, effort must be made to mapvalues to their corresponding attributes of a metadata object used torepresent the multi-page document in the content management system.

In light of all the foregoing shortcomings of the per-page approach todocument capture as applied to multi-page documents, automatic detectionand processing of a multi-page document as a single document isdisclosed. In various embodiments, automatic detection of the pagescomprising a multiple page document is performed. Data values areextracted from the pages comprising the document and used to populate asingle electronic data entry form for the multi-page document. Theoperator can then go through the electronic data entry form, for exampleto validate data fields as required, and the document capture and/orvalidation system shows the location in the captured document of thecorresponding data, regardless of which page(s) it occurs in, ratherthan the operator having to find and/or choose each page, indexing eachindependently, and then reconcile later data that occurs in and/or spansmultiple pages.

FIG. 7 is a block diagram illustrating an embodiment of a documentcapture system and process. In the example shown, scanned pages 702,704, and 706 comprise a multi-page document. First page recognitionand/or other techniques are applied in various embodiments to detectautomatically the beginning and/or ending of a multi-page document suchas document. The pages 702, 704, and 706 are identified through aprocess 708 as comprising a single multi-page document. A correspondingdocument type is determined and data values are extracted from pages702, 704, and 706 to populate a single data entry form 710 configured tocapture data values extracted from the multi-page document. In theexample shown, the respective locations within the page images 702, 704,and 706 of data extracted to populate form 710 are shown as smallcross-hatched rectangles. The rows at the bottom of page 702 and the topof page 704 in this example comprise a single table, list, or otherarray that spans pages 702 and 704. The corresponding extracted datavalues are in some embodiments captured initially in page specificarrays, the rows of which are concatenated in the example shown topopulate the single table at the bottom of form 710.

FIG. 8 is a block diagram illustrating an embodiment of an interface tovalidate a multi-page document. In the example shown, the interface 800includes a page image display area 802, in which in the example shown animage of page 702 of FIG. 7 is shown. The interface 800 further includesa data entry form area 804, in this example corresponding to the form710 of FIG. 7. Thumbnails for the pages 702, 704, and 706 of FIG. 7 (notnumbered individually in FIG. 8) are displayed in navigation pane 806.In the example shown, the topmost thumbnail as displayed in navigationpane 806 is highlighted as being currently “selected” for display inpage image display area 802. In various embodiments, selection by ahuman operator of a thumbnail in navigation pane 806 results in an imageof the corresponding page being displayed in page image display area802. In some embodiments, as an operator navigates to different formfields in the form area 804 a corresponding portion or portions of themulti-page document, in one or more pages may be navigated toautomatically. For example, navigation to a first row of the threecolumn table at the bottom of the form in this example may in someembodiments cause the first page 702 to be displayed. Selection of acell in one of the bottom three rows, either manually or automaticallyas the system advances to a next field to be validated, in variousembodiments may cause the second page of the multi-page document, fromwhich the corresponding data was extracted in this example, to bedisplayed in the page image display area 802. In various embodiments,selection of a field in form area 804 results in a snippet of acorresponding portion of the page from which the data associated withthat field was extracted is determined, retrieved, and displayed, forexample in a location adjacent or nearly adjacent to the field, asdescribed above.

FIG. 9A is a flow chart illustrating an embodiment of a process tocapture document data. In the example shown, the beginning and/or end ofa multi-page document is/are detected (902). For example, knowntechniques to detect a first page may be used, and a multi-page documentmay be determined to have been encountered if one or more subsequentpages are scanned prior to a next “first” page is detected. A documenttype is determined and a corresponding data entry form instance iscreated (904). Scalar (single value) and array (tables, lists, or othertwo dimensional sets of data) data values are identified and extracted,for example using OCR, OMR, or other automated extraction techniques(906). Occurrences of the same and/or dependent values in multiplelocations, including across page boundaries, may be used to performautomated and/or manual validation (908). For example, a name thatappears at the beginning of a life insurance application and again in anattached report of a physical examination may be cross-checked todetermine the accuracy of data extraction from one or both of thelocations. Rows of arrays that span multiple pages are concatenated intoa single form table (910). Array values may be validated using the fulltable, including across page boundaries (912). For example, quantity andunit price fields may be multiplied and the result compared to a lineitem subtotal, subtotals in all rows (including potentially across pageboundaries) may be summed and compared to an extracted total, etc.

FIG. 9B is a flow chart illustrating an embodiment of a process tocapture document data. In the example shown, a library of metadatadocument types is defined, with each document type containing scalarfields and tables of array fields (922). Automatic page recognition isdone as in prior art, with page types determined (924). Values areextracted into per-page scalar and array fields by name, and eachfield's location on the page is saved (926). The multi-page documenttype is determined from an analysis of the stream of page types (928).Data from the component pages is automatically combined into thedocument type (930). A given named scalar field may occur on any page,or in multiple pages. Data validation is performed (932).

FIG. 10 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.In the example shown, an indication is received that an operator is donevalidating a currently displayed data (1002), e.g., the “Date” field inthe example shown in FIG. 8. If no more fields remain to be validated(1004), the process ends. Otherwise, if the next field to be validatedis on the same page (1006) the next field in the data entry form isadvanced to and displayed, and a corresponding snippet or other portionof the current page, from which the associated data value to bevalidated was extracted, is displayed adjacent to the form field (1008).If the next form field requiring validation is associated with data froma different page of the multi-page document (1006), the systemautomatically retrieves or otherwise accesses the other page and/or anapplicable portion thereof (e.g., a corresponding snippet) (1010),transparently to the human operator, and the next form field and thecorresponding snippet obtained from the other page of the multi-pagedocument are displayed for validation (1008) transparently to andwithout requiring any further action by the human operator.

FIG. 11 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.In the example shown, a definition of a library of validation rules isreceived (1102). Examples include, without limitation, a rule requiringthat a first value extracted from a named field A must match a secondvalue extracted from a named field B. Another example is a rulerequiring that a sum or other computation based on a specified set offields must equal a value extracted from another named field, e.g.,subtotals in an array must sum to equal a total. Document typedefinitions are received (1104). Each definition identifies validationrules to be applied, and as applicable a mapping to the document typefields to be used to apply each rule. An operator interface is providedthat facilitates multi-field validation, including across pageboundaries (1106). In some embodiments, for example, the interfaceenables an operator to iterate through just the dependent fields thatrequire validation. As the operator corrects and/or confirms the enteredvalue for a first dependent field, for example, and hits “enter”, thesystem advances automatically to display a next one of the dependentfields and its associate document image portion, from whichever page inwhich it may be located. The system iterates through the dependentfields until the operator enters data that clears the validation errorand/or there are no more dependent fields to be displayed. In variousembodiments, by combining data extracted from multiple page images of amulti-page documents into a single document type and associated dataentry form, automated and manual validation of dependent data fieldsthat occur on or across different pages is facilitated, withoutrequiring software code and/or human action to navigate between dataentry forms used to capture data extracted from individual pages.

FIG. 12 is a flow chart illustrating an embodiment of a process toperform validation of data values extracted from a multi-page document.In the example shown, an instance of a multi-page document type isreceived (1202). Applicable validation rules are evaluated, e.g.,sequentially, including those requiring the concurrent processing ofdata values extracted from different pages, and any dependent fields aremarked as having an error if a rule fails (1204). If during validationby a human operator (see, e.g., FIG. 10) a data value as extracted iscorrected, e.g., the human operator enters a corrected value in a formfield, validation rules affected by the change are re-evaluated, forexample to ensure a correction that satisfied a first rule did notintroduce an inconsistency that caused a second rule to not besatisfied. If the value for a field is visually confirmed with the pageimage to be correct, then the field can be flagged so it is henceforthno longer marked as having an error when a rule is re-evaluated (1204).In this way, operators can be more efficient by navigating only tounconfirmed fields.

In various embodiments, human operator validation of errors involvingfields that have dependency relationships with other fields, such as a“name” value that occurs in more than one page of a multi-page document,is facilitated by displaying the fields together, in a single screen,along with each fields corresponding document image snippet, even if thesnippets come from different pages. Likewise, as an operator iteratesthrough error fields in a table or other two dimensional data structure,corresponding snippets are displayed, even if they come from multiple,different pages. The human operator need only navigate through fields ina single data entry form, and the system transparently retrieves anddisplays for each field its corresponding snippet or other partialimage, without regard to page boundaries.

Using techniques described herein, multi-page documents can be processedmore efficiently in the document capture context. Values in the samedocument can be reconciled and either auto-corrected or flagged formanual confirmation without switching between documents or data entryforms, copying over data values from one form to the other, etc. Thisfacilitates use of data redundancy found in many document images. Inaddition, the data entry form is abstracted from its page definition.The operator does not have to worry about where a value is, enable theoperator to focus on validating data on the form. If there arevariations in page versions, the operator does not have to worry aboutit. The location logic will find the right place. In addition, thedeveloper and operator do not have to incur the cost and complexity ofcopying data back and forth between page forms. Array data is shown inone table, rather than in a table per page, thereby improving the userexperience. Finally, it is easier to map content management metadataobjects to new document types, since all of the extracted data valuesfor and structure of a multi-page document are capture in one form.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A method of capturing document data, comprising:obtaining a multi-page document; extracting data from multiple pages ofthe multi-page document; identifying two or more values located on atleast two different pages of the multi-page document, wherein at least afirst one of the two or more values is dependent on at least a secondone of the two or more values; and validating the at least the first oneof the two or more values and the at least the second one of the two ormore values according to one or more validation rules.
 2. The method ofclaim 1, wherein the validating comprises: accessing a set of validationrules in a library of validation rules; sequentially applying the set ofvalidation rules to the at least the first one of the two or more valuesand the at least the second one of the two or more values; and markingthe at least the first one of the two or more values and the at leastthe second one of the two or more values as having an error if one ofthe validation rules fails.
 3. The method of claim 2, further comprisingreceiving a document type definition corresponding to the multi-pagedocument, wherein the document type definition identifies the set ofvalidation rules to be applied to the at least the first one of the twoor more values and the at least the second one of the two or morevalues.
 4. The method of claim 3, wherein the document type definitionincludes a mapping to document type fields to be used to apply eachrule.
 5. The method of claim 4, further comprising identifying adocument type of the multi-page document, and identifying the documenttype definition based on the identified document type of the multi-pagedocument.
 6. The method of claim 5, wherein the document type containsone or more scalar fields and one or more tables of array fields.
 7. Themethod of claim 6, further comprising extracting values from each pageinto per-page scalar and array fields by name.
 8. The method of claim 7,wherein for each extracted value, a corresponding location on the pagefrom which the value was extracted is saved.
 9. The method of claim 6,further comprising combining data extracted from the respective pagesinto a form associated with the document type, wherein combining dataextracted from the respective pages into a form associated with thedocument type includes forming an array that spans multiple pagesconcatenating a first set of rows of values extracted from a first pagewith a second set of rows of values extracted from a second page tocreate a combined set of rows to be included in the document type. 10.The method of claim 1, wherein the validating comprises providing one ormore of the at least the first one of the two or more values and the atleast the second one of the two or more values to a user for manualvalidation.
 11. The method of claim 10, further comprising presenting aninterface to the user, wherein the interface displays to the user aplurality of fields which are marked as having errors and enables theuser to iterate through the plurality of fields, wherein the pluralityof fields displayed to the user include only dependent fields thatrequire validation.
 12. The method of claim 11, further comprisingidentifying a document type of the multi-page document, and creating aninstance of a selected one of a plurality of type-specific data entryforms in a forms library based at least in part on the identifieddocument type of the multi-page document, wherein the plurality offields displayed to the user are fields contained in the createdinstance.
 13. The method of claim 12, wherein the two or more valuespresented to the user for manual validation are identified bydetermining whether the first one of the two or more values matches thesecond one of the two or more values in the multi-page document.
 14. Themethod of claim 11, wherein the interface is configured to repetitivelyiterate through each of the plurality of fields until either the userenters data that clears the validation error associated with the field.15. The method of claim 11, further comprising populating, based atleast in part on the data extracted from the multi-page document, aplurality of fields of the instance of the selected data entry formincluding the plurality of fields displayed to the user.
 16. The methodof claim 11, wherein as each form field is displayed, a correspondingsnippet or other partial image from a page from which a current datavalue associated with the form field was extracted is displayed adjacentto the field.
 17. The method of claim 1, wherein the first and secondones of the two or more values are contained in a table or array of themulti-page document.
 18. The method of claim 1, The method furthercomprising determining that a sequence of pages comprise the multi-pagedocument, wherein determining that a sequence of pages in a stream ofdocument page images comprise the single multi-page document includesprocessing each page individually to determine a corresponding pagetype; and processing the stream of page types to identify a sequenceassociated with a multi-page document type.
 19. A document capturesystem, comprising: a communication or other interface configured toreceive a multi-page document; and one or more processors coupled to theinterface and configured to: obtain a multi-page document; extract datafrom multiple pages of the multi-page document; identify two or morevalues located on at least two different pages of the multi-pagedocument, wherein at least a first one of the two or more values isdependent on at least a second one of the two or more values; andvalidate the at least the first one of the two or more values and the atleast the second one of the two or more values according to one or morevalidation rules.
 20. A computer program product to capture documentdata, the computer program product being embodied in a non-transitorycomputer readable storage medium and comprising computer instructionsfor: obtaining a multi-page document; extracting data from multiplepages of the multi-page document; identifying two or more values locatedon at least two different pages of the multi-page document, wherein atleast a first one of the two or more values is dependent on at least asecond one of the two or more values; and validating the at least thefirst one of the two or more values and the at least the second one ofthe two or more values according to one or more validation rules.